Avoiding Bias when Aggregating Relational Data with Degree Disparity

نویسندگان

  • David D. Jensen
  • Jennifer Neville
  • Michael Hay
چکیده

A common characteristic of relational data sets —degree disparity—can lead relational learning algorithms to discover misleading correlations. Degree disparity occurs when the frequency of a relation is correlated with the values of the target variable. In such cases, aggregation functions used by many relational learning algorithms will result in misleading correlations and added complexity in models. We examine this problem through a combination of simulations and experiments. We show how two novel hypothesis testing procedures can adjust for the effects of using aggregation functions in the presence of degree disparity.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique

In solving the classification problem in relational data mining, traditional methods, for example, the C4.5 and its variants, usually require data transformations from datasets stored in multiple tables into a single table. Unfortunately, we may loss some information when we join tables with a high degree of one-to-many association. Therefore, data transformation becomes a tedious trial-and-err...

متن کامل

Summarizing Relational Data Using Semi-Supervised Genetic Algorithm-Based Clustering Techniques

Problem statement: In solving a classification problem in relational data mining, traditional methods, for example, the C4.5 and its variants, usually require data transformations from datasets stored in multiple tables into a single table. Unfortunately, we may loss some information when we join tables with a high degree of one-to-many association. Therefore, data transformation becomes a tedi...

متن کامل

Framing Bias in the Interpretation of Quality Improvement Data: Evidence From an Experiment

Background A growing body of public management literature sheds light on potential shortcomings to quality improvement (QI) and performance management efforts. These challenges stem from heuristics individuals use when interpreting data. Evidence from studies of citizens suggests that individuals’ evaluation of data is influenced by the linguistic framing or context of that information an...

متن کامل

Bias in Markov models of disease.

We examine bias in Markov models of diseases, including both chronic and infectious diseases. We consider two common types of Markov disease models: ones where disease progression changes by severity of disease, and ones where progression of disease changes in time or by age. We find sufficient conditions for bias to exist in models with aggregated transition probabilities when compared to mode...

متن کامل

Ensemble Classification for Relational Domains

Ensemble classification methods have been shown to produce more accurate predictions than the base component models (Bauer and Kohavi 1999). Due to their effectiveness, ensemble approaches have been applied in a wide range of domains to improve classification. The expected prediction error of classification models can be decomposed into bias and variance (Friedman 1997). Ensemble methods that i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003